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ABSTRACT 

The contention that interpretation of u student's 
performance on a criterion referenced test should be. independent of 
the performance of his classmates is challenged. The Mastery Learning 
Test Models which was developed for analyzing criterion referenced 
test data^ is described. An estimate of the proportion of students in 
an instructional group which has achieved the referent objective is 
usable as a prior probability in interpreting individual responses. 
Considering instructional group performance enhances estimates of 
individual performance. Correlational data from a set of test items 
and a representative population of students are used to estimate the 
required item parameters. (Author) 
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Using Group Performance to Interpret Individual 
Responses to Criterion-Referenced Tests 

There is currently considerable controversy over what constitutes a 
criterion-referenced test. Typically the concept ''criterion-referenced" 
is defined in relation to norm-referenced or standardized tests. For 
example... "norm-referenced measures compare the student's performance 
with the mean of a norm group whereas "Crii.diiuu-iereieuced-^mB^aresr'^om- 
pare his performance with a specified criterion score." (Livingston, 1972). 
On the basis of such definitions it has been- claimed that interpretation of 
a student's performance on a criterion-referenced test should not be depen- 
dent upon the performance of his classmates or other norm groups. 

"the interpretation of a student's performance in a 

criterion-referenced situation is absolute and axiomatic, 

not dependent upon how other learners perform." 

(Airasian and Madaus, 1972) 

4 

"(criterion-referenced)... measurements are absolute indices 
designed to indicate what the pupil has or has not learned from 
a given instructional segment. The measurements are absolute 
in that they are interpretable solely vis-a-vis a fixed per- 
formance standard or criterion and neec not be interpreted 
relative to other measurements." 

(Block, 1971) 

It is contended here that norm-grout) performance is useful and legitimate 
information for both the construction and application of criterion-referenced 
tests* 



A criterion-referenced test is here defined as a set of items sampled 
from a domain which has been judged to be an adequate representation of an 
instructional objective. This definition does not limit criterion-referenced 
tests to narrowly defined behavioral objectives for which an item form (Osbum, 
1968) specifies how to generate every item in the domain. It is desirable 
that the domain be described in operational terms; using this description 
another test aeveToper should "be able to generate an equivalent domain of test 
items. The assumptions or theory relating the domain of items to the referent 
objective should be explicitly stated. 

Desirable procedures for selecting a sample of items from a domain depend 
upon the intended application of the test. One application of a criterion- 
referenced test is to estimate che proficiency of individual students relative 
to some achievement continuum. (Kriewall, 1972). This appears to be Glaser's 
(1963) original conception of the purpose of a criterion-referenced test. This 
application is based on the assumption that, "Underlying the concept of achieve 
ment measurement is the notion of a continuum of knowledge acquisition ranging 
from no proficiency at all to perfect performance." (Glaser, 1963). For 
applications where hand scoring of tests is used, a random or stratified 
random sampling of items from the domain permits the unweighted number-of- 
correct-responses to be interpreted as a degree of proficiency measure. If 
computer j^coring is used, a sample of highly discriminating items will yield 
a better estimate of proficiency. Thus, the rejection of sampling based on 
item discrimination indices (norm-group performance) is based on the assump- 
tions that a degree-of-proficiency measure is required and that the test must 
be hand-scored. 



A frequent application of criterion-referenced tests is the making of 
categorical mastery, non-mastery decisions for students comprising an instruc- 
tional group. Subsequent instruction for a student is contingent upon the 
category in which he is placed. Typically, test developers have computed a 
degree-of-proficiency index and then, on most frequently an arbitrary basis, 
selected a critical "passing" score. A problem that arises is that it is 
difficult, perhaps impossible, to define a meaningful degree-of-prof icTency 
index for many types of legitimate instructional objectives. Ebel (1971) 
concludes that "criterion-referenced measurement may be practical in those 
few areas of achievement which focus on cultivation of a high degree of skill 
in the exercise of a limited number of abilities." Ebel's conclusion is based 
on the premise that a degree-of-prof icieucy scale "...anchored at the extremi- 
ties--a score at the top of the scale indicating complete or perfect mastery 
of some defined abilities; one at the bottom indicating complete absence of 
those abilities." is required. Fortunately, such a measurement scale is not 
needed for the categorical decision application. 

The Mastery Learning Test Model 

The Mastery Learning Test Model has been designed to provide an appro- 
priate algorithm for analyzing criterion-referenced test data for making 
the following instruction decision: "which students have achieved the. 
referent objective." Two statistics are computed: the probability that a 
given student has achieved the objective and the proportion of an instructional 
group that have achieved the objective. The model assumes that each student 
in an instructional group can be treated as belonging to one of two groups-- 
a group that has achieved the objective and one that has failed to achieve. 
The two-state assumption does not deny the possibility of partial achievement 



of the objective. It does imply that categorization of students into two 
groups, masters and non-masters, is the desired type of decision and the 
basis for subsequent ins true tion. 

The Mastery Learning Test Molel and the true score theory upon which it 
is based are derived in an earlier paper (BeseJ, 1972) • This model is related 
to a simpler mastery testing model suggested by Emrick (1971). Emrick's model 
assumes that measurement error can be accounted for by two test parameters: 
or the probability that a non-master will give a correct answer to an item; 
and P " the probability that a master will given an incorrect answer to an 
item. His model implicitly assumes that all item difficulties and inter-item 
correlations are equal. This assumption can be avoided by increasing the number 
of test parameters— either by permitting itemed parameters, or Item p parameters, 
or both. 

Parameter Estimation 

Both the y and p item parameters can be estimated from the item response 
data collected from a representative sample of students. Two parameter estima- 
tion algorithms have been developed (Besel, 1973, a and b) for a Mastery Learning 
Model which has a single test— p parameter and item--cy parameters. Least squares 
estimates of the parameters are computed using three classes of empirical 
data: 

1. Item difficulties, 

2. Inter-item covariances, 
3« Score histograms. 



The first algorithm (Besel, 1973 a) computes the least-squares estimates 
using an independent estimate of the proportion of student that have achieved 
the referent objective (GMP). The second algorithm requires no input estimate 
of GMP: it is estimated from the data in a-Jdition to the a parameters. 

The stability of the parameter estimates was evaluated, for each algorithm, 
using tesf^ata from the end-of-unit Criterion Exercises of the SWRL Beginning 
Reading Program. Data from two consecutive years (1970-71 and 1971-72) were 
sampled from schools participating in the Quality Assurance Tryout. Each 
Criterion Exercise measures the achievement of four objectives: (1) Storybook 
Words, (2) Program Word Elements, (3) Word Attack (novel words), and (4) Letter 
Names. Five, three-option, multiple-choice items are used for each objective* 
Data from all ten units of the program were analyzed; the sample sizes shrank 
from 263 to 98 for the first year and from 418 to 173 for the second year. 

The means and variances of the differences between the parameter estimates 
for the two years werfe examined (see Table 1). Computations were made for 
Item Of, average cy (fcT), and test p. For the "Fixed GMP" algorithm two esti- 
mates of GMP were used. The first estimate was the proportion of students 
scoring 80% (4 right out of 5) or better for the outcome. The second esti- 
mate was the proportion with a perfect score. The item a differences are 
based on 50 items, average cv and test P on 10 tests. The mean differences 
could be due partially to systematic differences in the student populations. 
Different school districts were represented in the two samples. The variances 
are more appropriate estimates of parameter stability. 
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For the second algorithm (GMP not fixed) the variances vary considerably 
across outcomes. The "fixed-GMP" algorithm achieved uniformly better stability 
with the perfect score criterion noticeably better than with the 80% criterion. 
The variances for both item or and average or decreased as the difficulty of the 
objective increased. Letter names is the easiest objective, word attack the 
most difficult. The variances of test 8, on the other hand, increased as the 
difficulty of the objective increased. This trend was apparent in all three 
sets of calculations for both algorithms. This result is consistent with the 
notion that ideally one would like to estimate 0 from the responses of a group- 
all of which have achieved the objective* Likewise the item alphas could be 
"best" estimated from a group— -none of which have achieved the objective. When 
a mixed group is used^ 6 is estimated most accurately when a high proportion 
of the group has achieved the objective. Lowering the GMP of the norm group 
improves the accuracy of the or estimates at the e::pense of B accuracy. 

Prior Probabilities 

The Mastery Learning Model is a Bayesian statistical model. The response 
of a student to an item from the test is used to modify an existing probability 
estimate that the student has achieved the referent objective. A Bayesian 
model requires an initial prior probability estimate. One estimate which 
results in better probabillty-of-Tnastery estimates than an "ignorance" (prior 
probability equal .5) assumption is the estimated proportion of students in 
mastery (GMP) for the appropriate instructional group. If the test parameters 
have been previously estimated for a representative norm group, Equation 1 
(Besel, 1972) , can be used to estimate GMP. 



" l-S-F (1) 
GMP « proportion of students in mastery state 
cy - average of item a parameters 

P = average of item 0 parameters or test 3 parameter 
U/K = mean percentage score 

While the use of group-estimated priors is somewhat controversial for selection 
decisions across instructional groups (Novick, 1970), it promises to enhance 
instructional decisions within an instructional group. 

Summary 

The usage of an independent estitnate of the proportion of students in a 
norm group which have achieved an objective resulted in significantly improved 
stability of Mastery Learning parameters. This should result in increased 
validity of the Mastery Learning Test Model for making categorical mastery— 
non-mastery decisions. This Test Model can be used to make mastery decisions 
on the basis of very short tests. Using the proportion-in-mastery estimate 
for an instructional group as a prior-probability results in improved estimates 
of the probability that an individual student has achieved the objective. 
Norm group data can also be used to select the best set of items from a domain 
for the mastery decision application. 



Table 1. Stability of Mastery Learning Parameters 
(Mean Difference/Variances of Difference) 



Outcome 


Parameter 


Minimum Sura of 
Squares Solution 


807o Criterion 
Solution 


100% Criterion 
Solution 


1 


Item cy 


-.081 ^^^-^^ 


-.026 
^-^^0191 


-.013 

^,^^^0076 




-.081 ^^"^ 


-.026 

^.-•-^J3031 


-.013 

^^--^0008 


o 

p 


.018 ^^'^ 
^.-''^0006 


-.002 

^^^0002 


-.004 ^^'^ 
^^^^0001 


2 


Item cy 


-.059 ^^-^ 
^-^^0126 


-.042 

^--'^^0170 


-.041 
^^^^^072 


Or 


-.059 ^.^'^ 
^^^0033 


-.042 
^,^-^^0015 


-.041 

^^^^^0004 


Q 
P 


-.003 ^."^ 
^^^\0004 


-.007 

^.^^0005 


-.006 
^^^^^0001 


3 


Item ot 


-.037 ^^^^ 
^.^-^^0083 


-.032 ^.^^ 
^."^^0096 


-.020 

^^^^043 


Or 


-.037 
^^^^0011 


-.032 ^.^^ 


-.020 .^.-^^ 
^^^^^^007 




o 
P 


-.000 
^^^^0006 


-.001 
^-"^0006 


-.003 '^.^^ 
^,,^-^^0001 




Item ot 


..052 
^^"^ .0956 


-.026 ^.'^ 
^^-""^0354 


-.036 

^,-''^0080 


4 


'a 


.052 

^^^0418 


-.026 
^.-"•^OlOl 


-.036 

^.--^^0010 ■ 




3 


-.004 

^."''^^0002 


-.006 ^^-^ 


-.004 
^^•^^0000 
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